# your code here
library(tidyverse)
library(tidyquant)
library(knitr)
library(gt)
country <- c("AKUR", "DCUR")
unemployment <- country %>% tidyquant::tq_get(get = "economic.data",
from = "2014-01-01",
to = "2024-08-01") %>%
stats::na.omit()
unemployment <- unemployment %>%
dplyr::mutate(series = stringr::str_replace_all(symbol, c("AKUR" = "AK",
"DCUR" = "DC")),
value = price) %>%
dplyr::select(date, series, value)
unemployment %>% plotly::plot_ly(x = ~date, y = ~value, name = ~series, type = "scatter", mode = "lines")FIN-450
Midterm Exam
Questions
Q1. Use of API and Data Parsing. (5 points)
Extract the following data from FRED from 2014-01-01 to 2024-08-01.
- Unemployment Rate in Alaska.
- Unemployment Rate in District of Columbia.
Perform the following:
- Return a long data frame named
unemploymentwith columnsdate,seriesandvalue. - Amend the series names to
AKandDCusingtidyversefunctions. - Using
plotly, plot both time series on the same line chart.ggplot2converted toplotlyis not acceptable.
Q2 Data Wrangling (15 points)
A) Using the sp400 components dataset below: (5 pts)
- Return a dataframe that shows the
weightof eachsectorin the S&P 400. - Round the
weightto 3 decimals. - Sort the sector descending by weight.
# hint: the output of your code should return a dataframe looking like the output of this one
example <- dplyr::tibble(sector = c("Information Technology","Consumer Discretionary","Utilities"),
weight = c(0.15,0.10,0.05))
kable(example)| sector | weight |
|---|---|
| Information Technology | 0.15 |
| Consumer Discretionary | 0.10 |
| Utilities | 0.05 |
library(RTLedu)
sp <- sp400_desc
# your code here
weight <- sp %>%
dplyr::select(sector, weight) %>%
dplyr::mutate(weight = round(weight, 3)) %>%
dplyr::arrange(desc(weight))
weight# A tibble: 401 × 2
sector weight
<chr> <dbl>
1 Health Care 0.008
2 Industrials 0.007
3 Industrials 0.007
4 Energy 0.007
5 Industrials 0.007
6 Consumer Discretionary 0.006
7 Energy 0.006
8 Materials 0.006
9 Information Technology 0.006
10 Industrials 0.006
# ℹ 391 more rows
B) What is the total weight of the 15 companies with the largest weights in the sp400? (5 pts)
For example, if each of the 10 largest weight companies had a weight of 1%, it would be 10%.
# your code here
top15 <- sp %>%
select(company, weight) %>%
dplyr::arrange(desc(weight)) %>%
dplyr::slice(1:15) %>%
dplyr::mutate(answer = cumsum(weight))
answer <- top15 %>% slice(15) %>% select(answer)
gt(answer)| answer |
|---|
| 0.09274607 |
C) Searching (5 points)
You want to extract companies with the following criteria:
- They are either in the Health Care OR Communication Services
sector, - AND they have a
weightgreater than 0.4%.
Correct the code I wrote which is not working…
# leave this code as is and correct it in the next chunk
sp400_desc %>% tidyr::select(sector == "Communication Services" AND sector == "Health Care" OR weight > 0.004)# Your corrected code here
corrected <- sp400_desc %>% dplyr::filter(sector == "Communicaiton Services" | sector == "Health Care", weight > 0.004)
gt(corrected)| symbol | company | identifier | sedol | weight | sector | shares_held | local_currency |
|---|---|---|---|---|---|---|---|
| ILMN | Illumina Inc. | 452327109 | 2613990 | 0.007700409 | Health Care | 1262275 | USD |
| AVTR | Avantor Inc. | 05352A100 | BJLT387 | 0.005408540 | Health Care | 5386949 | USD |
| UTHR | United Therapeutics Corporation | 91307C102 | 2430412 | 0.005326670 | Health Care | 352542 | USD |
| THC | Tenet Healthcare Corporation | 88033G407 | B8DMK08 | 0.004949650 | Health Care | 759273 | USD |
| BMRN | BioMarin Pharmaceutical Inc. | 09061G101 | 2437071 | 0.004581570 | Health Care | 1508557 | USD |
| SRPT | Sarepta Therapeutics Inc. | 803607100 | B8DPDT7 | 0.004304370 | Health Care | 755664 | USD |
Q3. Correlation (6 points)
You just graduated in Finance and took a job as an investment adviser for a company specializing in the real estate sector. Your company runs advertising portraying the benefit of the diversification it provides at all times versus equity indices.
You are skeptical.
- Use the following data set which represents prices of an ETF
RealEstateandsp400. - Use
log()returns on for your analysis.
cor <- RTLedu::correlation %>%
dplyr::group_by(series) %>%
mutate(log_return = log(value / dplyr::lag(value))) %>%
tidyr::drop_na(log_return)A) Compute and plot a 60-day rolling correlation. (3 points)
# your code here
cor.roll <- cor %>%
dplyr::select(date, series, log_return) %>%
tidyr::pivot_wider(names_from = series, values_from = log_return) %>%
dplyr::mutate(cor60 = slider::pslide_dbl(
.l = list(RealEstate, sp400),
.f = ~ cor(.x, .y),
.before = 60,
.after = 0,
.complete = TRUE
)) %>%
tidyr::drop_na()
cor.roll %>%
ggplot(aes(x = date, y = cor60)) +
geom_line(color = "blue") +
labs(title = "60-day Rolling Correlation", x = "", y = "")B) Compute the AVERAGE of the ROLLING correlation in the following two periods and select the appropriate TRUE statement(s). (3 points)
- Pre COVID19: 2017-2019.
- Post COVID19: 2020-now.
For full points, you must create a variable in your dataframe using dplyr::mutate() with the pre and post correlation periods (tidy workflow).
# your code here
roll_cor <- cor.roll %>%
dplyr::mutate(periods = dplyr::if_else(date < "2020-01-01", "Pre-COVID19", "Post-COVID19")) %>%
group_by(periods) %>%
dplyr::summarise(avg_roll_cor = mean(cor60))
kable(roll_cor, digits = 3)| periods | avg_roll_cor |
|---|---|
| Post-COVID19 | 0.687 |
| Pre-COVID19 | 0.490 |
- Pre-COVID19 = 0.49, Post-COVID19 = 0.69
- Pre-COVID19 = 0.52, Post-COVID19 = 0.69
- Pre-COVID19 = 0.49, Post-COVID19 = 0.81
- Pre-COVID19 = 0.52, Post-COVID19 = 0.81
- Pre-COVID19 = 0.49, Post-COVID19 = 0.687 = 0.69
your answer
A
Q4 Seasonality (6 points)
Using the RTLedu::unemployment data set:
A) STL decomposition (4 points)
In the code chunk below: Use the feast::STL() model and plot the results using fabletools::components().
Add a short paragraph telling me what you observe in the change over time in their seasonality patterns.
From the chart we can see that the unemployment rates in each state seem to follow a similar trend in all aspects, with Alaska having a more subtle increase and decrease in rates throughout the years, we can also see that California has the most distinct increases and decreases over the years. Observing seasonality, we can see that Alaska has the most noticeable seasonality patterns throughout the years with an interesting decrease overtime, while California and New Jersey have had a very similar pattern of seasonality. Interestingly, California and New Jersey have had an increase in seasonaly patterns throughout the years. The STL decomposition shows the effects of economic changes throughout the years through analysis of the remainder, where there was a significant jump in the unemployment rate beginning 2020 (COVID-19).
# your code here
seas <- RTLedu::unemployment
library(fabletools)
library(feasts)
library(tsibble)
seas_tsi <- seas %>%
tsibble::as_tsibble(key = state, index = date) %>%
tsibble::index_by(freq = ~yearmonth(.)) %>%
tsibble::group_by_key() %>%
dplyr::summarise(
rate = mean(rate),
.groups = "keep"
) %>%
stats::na.omit()
stl<- seas_tsi %>%
fabletools::model(feasts::STL(formula = rate ~ season(window = 13)))
stl %>% fabletools::components() %>% autoplot()B) Compute the Trend and Seasonality strength statistics. (2 points)
# your code here
str_stats <- seas_tsi %>%
fabletools::features(rate, feasts::feat_stl)
kable(str_stats, digits = 3)| state | trend_strength | seasonal_strength_year | seasonal_peak_year | seasonal_trough_year | spikiness | linearity | curvature | stl_e_acf1 | stl_e_acf10 |
|---|---|---|---|---|---|---|---|---|---|
| Alaska | 0.856 | 0.656 | 2 | 8 | 0 | -0.097 | -0.084 | 0.667 | 0.978 |
| California | 0.931 | 0.236 | 4 | 11 | 0 | -0.064 | -0.204 | 0.670 | 0.818 |
| NewJersey | 0.877 | 0.281 | 7 | 11 | 0 | -0.012 | -0.165 | 0.677 | 0.982 |
Q5. Regression Analysis (8 points)
A) Hedging: Perform a regression and select the TRUE statement(s) (4 points)
This question will use the RTLedu::reg3 data set where:
ICLNis a clean energy ETF.XLEis the Energy industry ETF of the sp500 index.You own
ICLNin your portfolio.Your are interested in understanding how
XLEreturns explainICLNreturns.No residuals or ACF tests are required for this question.
reg1 <- RTLedu::reg3
# your code here
library(broom)
fit <- stats::lm(ICLN ~ XLE, reg1)
model_fit <- broom::tidy(fit)
kable(model_fit, digits = 3)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.000 | 0.000 | 0.453 | 0.651 |
| XLE | 0.437 | 0.016 | 27.879 | 0.000 |
hedge_ratio <- cor(reg1$ICLN, reg1$XLE) * (sd(reg1$ICLN) / sd(reg1$XLE))- The regression and beta (coefficient estimate) are significant.
- The beta (coefficient estimate) is significant and the regression is not.
- To hedge your thousand dollar investment in
ICLN, you should sell approximately $450 ofXLEshares. - To hedge your thousand dollar investment in
ICLN, you should sell approximately $550 ofXLEshares.
your answer(s)
B, C
B) Regression Residuals (4 points)
A work colleague has done the regression shown below.
Her boss knows you have a Finance background and asking you for your critical opinion.
Write a few bullet points summarizing your conclusions.
- The Coefficient X is significant and the model explains 76.31% of variability, however the residuals tell another story.
- We can see from the residuals vs fitted graph that there is a curved pattern, which shows clear non-linearity, meaning that the model might not fully capture the true relationship of the data, I would suggest a non-linear model, perhaps a cubic function of some sort.
- From the Normal Q-Q graph we can see that the residuals are not normally distributed at least near the tails of the residuals.
- There are present observations that may be influencing the model (82, 209, 20, 19), however this should be analyzed after transforming the model.
- From the Breusch-Pagan Test, we can see that heteroscedasticity is present
library(ggfortify)
reg <- lm(y ~ x,data = RTLedu::reg2)
RTLedu::reg2 %>% ggplot(aes(x = RTLedu::reg2$x, y = RTLedu::reg2$y)) +
geom_point(alpha = 0.6, color = "blue") +
labs(title = "Scatterplot of our Data", x = "x", y = "y") +
theme_minimal()results <- summary(reg)
model_results <- broom::tidy(results)
kable(model_results, digits = 3)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.461 | 0.277 | 1.664 | 0.099 |
| x | 0.677 | 0.038 | 17.770 | 0.000 |
autoplot(reg, size =0.5)test_results <- lmtest::bgtest(fit) %>% broom::tidy()
kable(test_results)| statistic | p.value | parameter | method |
|---|---|---|---|
| 5.224175 | 0.022275 | 1 | Breusch-Godfrey test for serial correlation of order up to 1 |